A Bayesian Perspective on Generalization and Stochastic Gradient Descent

نویسندگان

Samuel L. Smith

Quoc V. Le

چکیده

We consider two questions at the heart of machine learning; how can we predict if a minimum will generalize to the test set, and why does stochastic gradient descent find minima that generalize well? Our work responds to Zhang et al. (2016), who showed deep neural networks can easily memorize randomly labeled training data, despite generalizing well on real labels of the same inputs. We show that the same phenomenon occurs in small linear models. These observations are explained by the Bayesian evidence, which penalizes sharp minima but is invariant to model parameterization. We also demonstrate that, when one holds the learning rate fixed, there is an optimum batch size which maximizes the test set accuracy. We propose that the noise introduced by small mini-batches drives the parameters towards minima whose evidence is large. Interpreting stochastic gradient descent as a stochastic differential equation, we identify the “noise scale” g = (NB −1) ≈ N/B, where is the learning rate, N the training set size and B the batch size. Consequently the optimum batch size is proportional to both the learning rate and the size of the training set, Bopt ∝ N . We verify these predictions empirically.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A PAC-Bayesian Analysis of Randomized Learning with Application to Stochastic Gradient Descent

We analyze the generalization error of randomized learning algorithms—focusingon stochastic gradient descent (SGD)—using a novel combination of PAC-Bayesand algorithmic stability. Importantly, our risk bounds hold for all posterior dis-tributions on the algorithm’s random hyperparameters, including distributions thatdepend on the training data. This inspires an adaptive sampling...

متن کامل

Stochastic Gradient Descent as Approximate Bayesian Inference

Stochastic Gradient Descent with a constant learning rate (constant SGD) simulates a Markov chain with a stationary distribution. With this perspective, we derive several new results. (1) We show that constant SGD can be used as an approximate Bayesian posterior inference algorithm. Specifically, we show how to adjust the tuning parameters of constant SGD to best match the stationary distributi...

متن کامل

Elastic Distributed Bayesian Collaborative Filtering

In this paper, we consider learning a Bayesian collaborative filtering model on a shared cluster of commodity machines. Two main challenges arise: (1) How can we parallelize and distribute Bayesian collaborative filtering? (2) How can our distributed inference system handle elasticity events common in a shared, resource managed cluster, including resource ramp-up, preemption, and stragglers? To...

متن کامل

Deep Learning: A Bayesian Perspective

Deep learning is a form of machine learning for nonlinear high dimensional pattern matching and prediction. By taking a Bayesian probabilistic perspective, we provide a number of insights into more efficient algorithms for optimisation and hyper-parameter tuning. Traditional high-dimensional data reduction techniques, such as principal component analysis (PCA), partial least squares (PLS), redu...

متن کامل

Identification of Multiple Input-multiple Output Non-linear System Cement Rotary Kiln using Stochastic Gradient-based Rough-neural Network

Because of the existing interactions among the variables of a multiple input-multiple output (MIMO) nonlinear system, its identification is a difficult task, particularly in the presence of uncertainties. Cement rotary kiln (CRK) is a MIMO nonlinear system in the cement factory with a complicated mechanism and uncertain disturbances. The identification of CRK is very important for different pur...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

CoRR

دوره abs/1710.06451 شماره

صفحات -

تاریخ انتشار 2017

A Bayesian Perspective on Generalization and Stochastic Gradient Descent

نویسندگان

چکیده

منابع مشابه

A PAC-Bayesian Analysis of Randomized Learning with Application to Stochastic Gradient Descent

Stochastic Gradient Descent as Approximate Bayesian Inference

Elastic Distributed Bayesian Collaborative Filtering

Deep Learning: A Bayesian Perspective

Identification of Multiple Input-multiple Output Non-linear System Cement Rotary Kiln using Stochastic Gradient-based Rough-neural Network

عنوان ژورنال:

اشتراک گذاری